Hierarchies of Indices for Text

نویسندگان

  • Ricardo Baeza-Yates
  • Eduardo F. Barbosa
چکیده

| We present an eecient implementation of a recently known index for text databases, when the database is stored on secondary storage devices such as magnetic or optical disks. The implementation is built on top of a new and simple index for texts called pat array (or suux array). Considering that text searching in a large database spends most of the time accessing external storage devices, we propose additional index structures and searching algorithms for pat arrays that reduce the number of disk accesses. We present two index structures: a two-level hierarchy model that uses main memory and one level of external storage (magnetic or optical devices) and a three-level hierarchy model that uses main memory and two levels of external storage (magnetic and optical devices). Performance improvement is achieved in both models by storing most of higher index levels in faster memories, thus reducing accesses in the slowest devices in the hierarchy. Analytical and experimental results are presented for both models. For 160 megabytes of text stored on cd-rom disk the two-level model using 2 megabytes of main memory costs 20% of the pat array used as a single level.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identifying Metaphor Hierarchies in a Corpus Analysis of Finance Articles

Using a corpus of over 17,000 financial news reports (involving over 10M words), we perform an analysis of the argument-distributions of the UPand DOWN-verbs used to describe movements of indices, stocks, and shares. Using measures of the overlap in the argument distributions of these verbs and k-means clustering of their distributions, we advance evidence for the proposal that the metaphors re...

متن کامل

Mining bilingual topic hierarchies from unaligned text

Recent years have seen an exponential growth in the amount of multilingual text available on the web. This situation raises the need for novel applications for organizing and accessing multilingual content. Common examples of such applications include Multilingual Topic Tracking, Cross-Language Information retrieval systems etc. Most of these applications rely on the availability of multilingua...

متن کامل

Politeness Orientation in Social Hierarchies in Urdu

The present research is aimed at investigating how the politeness of the speakers of Urdu is influenced by their relative social status in society. The researcher took politeness theory of Brown and Levinson (1978, 1987) as a model. To observe politeness of Urdu speakers, speech act of apology with different strategies was selected. A Discourse Completion Task (DCT) was used as an instrument to...

متن کامل

Incremental Construction of Topic Hierarchies using Hierarchical Term Clustering

Topic hierarchies are very useful for managing, searching and browsing large repositories of text documents. The hierarchical clustering methods are used to support the construction of topic hierarchies in a unsupervised way. However, the traditional methods are ineffective in scenarios with growing text collections. In this paper, an incremental method for the construction of topic hierarchies...

متن کامل

Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies

Introduction OHCO-1 Thesis: Text is an Ordered Hierarchy of Content Objects Arguments Pragmatic Empirical Theoretical Counterexamples: Multiple Logical Hierarchies In the Old (SGML) View Genres Determine Text Objects The New (TEI) View: Perspectives Determine Text Objects Consequences of the Shift: There is no Unique Logical Hierarchy OHCO-2 Thesis: Perspectives Determine OHCOs Counterexamples:...

متن کامل

Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies

Mining phrases, entity concepts, topics, and hierarchies from massive text corpus is an essential problem in the age of big data. Text data in electronic forms are ubiquitous, ranging from scientific articles to social networks, enterprise logs, news articles, social media and general web pages. It is highly desirable but challenging to bring structure to unstructured text data, uncover underly...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996